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A SYSTEM AND METHOD FOR MOLECULE SELECTION USING EXTENDED 

TARGET SHAPE 



This application claims the benefit under 35 U.S.C. § 119(e) of provisional 
5 application number 60/129,915, filed April 19, 1999, which is hereby incorporated by 
reference in its entirety. 



1. INTRODUCTION 

The present invention includes a means to obtain one or more initial 
10 candidate molecules that are at least somewhat dissimilar to a chosen target molecule or 
"targetshape", to produce initial variants of the initial candidate molecules, and to screen or 
select candidates from among those variants for molecules that are either more or less 
similar to the targetshape. 

15 2. BACKGROUND OF THE INVENTION 

The new field of applied molecular evolution, or molecular diversity, is rapidly 
becoming of central importance in the generation of useful molecules for drugs, vaccines, 
biosensors, catalysts, and so forth. Molecular diversity is based on generating very large 
libraries of candidate compounds, up to 10 !5 for quasirandom single stranded RNA or DNA 

20 sequences, 10 13 for phage displayed polypeptides, and into the millions for libraries of small 
molecules. These libraries are then screened or subjected to selection in order to find useful 
candidate compounds. 

Typical screening procedures, as specified, e.g., in U.S. Patent No. 5,824,514 to S. 
Kauffman and Balivet, incorporated herein by reference in its entirety, are based on the use, 

25 for example, of a ligand as the screen, and then screening for a novel molecule able to bind 
the ligand. For example, the ligand might be the estrogen receptor and phage display 
libraries are searched for novel peptides or polypeptides able to bind the estrogen receptor. 
Any such peptide or polypeptide is a candidate drug which might mimic, modulate, agonize, 
or antagonize the action of estrogen. In an equivalent procedure, the SELEX procedure, an 

30 RNA molecule able to bind a target is selected. Typically, once an initial set of candidate 
compounds is located, the candidates are in one form or another, "amplified" or replicated 
and then subjected to successive binding and amplification cycles in order to winnow down 
to good binding candidates. The U.S. Patent No. 5,824,514, Patent No. W09424314 to S. 
Kauffman and J. Rebek, S. Brenner et al. Proc. Natl. Acad. Sci. USA, 1992, 89:5381, are 
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hereby incorporated in their entireties as non limiting examples of generating, 
characterizing, and screening molecular diversity libraries. 

The term in the art for the early stages of the drug development process is called 
"lead discovery," and a successful candidate is referred to as a "lead." 
5 Tnreegapsinthecunentteclmologyarebecormnginciieasin^ 

1) Consider screening a population of molecules for candidates that mimic the 
shape and/or structure of some target compound. Present screening or selection techniques 
primarily identify candidates that are very close in shape and/or function to the target 
15 compound, e.g., candidates that efficiently bind to receptors or antibodies of the target 

10 compound. Therefore, molecules that are "close" in shape and/or structure to a viable 
mimic for the target molecule, but not similar enough to bind efficiently a target receptor or 
target antibody remain undetected by current screening procedures. 
20 2 ) A second, pressing problem in the drug discovery field is referred to herein 

as "The Multiple Target Problem". Typical drug compounds must satisfy a number of 
15 criteria. For example, a compound may be sought that selectively binds a particular 
receptor in preference to one or more different receptors, crosses cell membranes and 
25 nuclear membranes, survives oral ingestion, does not cross the blood brain barrier, shows 

good renal clearance, and does not exhibit a variety of cross binding properties to other 
molecules or sites that would cause further side effects. The process of further developing 
20 or modifying a lead compound to meet such criteria is often defined as lead optimization. 
Lead optimization is financially and labor intensive. For example, if the total cost of 
development of a drug, including clinical trials, is on the order of 200 to 300 million 
dollars, then the typical cost of lead discovery might be on the order of 1 million dollars, 
while lead optimization may typically cost 20 to 40 million dollars. That is, lead 
25 optimization is a far more expensive and complex step in the drug development process 
than lead discovery. Indeed, the very field of molecular diversity is making the discovery 
of good leads ever easier, hence commoditizing the discovery of drug leads. 

Problems 1 and 2 above are related: Solving the multiple target problem, in general, 
requires finding a set of initial candidate molecules or leads able to meet a number of 
30 different criteria. One would expect to find initial candidates that were only slightly able to 
perform several or all of the tasks, then optimizing one or more candidates until optimum 
(perhaps compromise) candidates were obtained. Thus, the capacity to find candidates to 
fulfill multiple tasks simultaneously is typically going to require the abUity to locate and 
optimize molecules which are initially quite poor at all or most of the tasks. 
35 3) border to solve me multiple target problem, it will be necessary to generate 

50 ever "improved" libraries of candidate molecules "spotted into" the proper region of 
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10 



molecular shape space. Thus, suppose one wished to find a molecule able to bind the 
estrogen receptor and also able to bind some other receptor, X. Initial candidates might be 
poor at both tasks, so poor that one could not obtain binding either to the X or the estrogen 
receptor. It is an object of the present invention to detect molecules that are modestly close 
5 to being able accomplish both tasks, i.e., bind both the estrogen and X receptors. Then, the 
initial screen will have identified a good region of shape space where candidates to solve 
both tasks are located. Then, further screening or selection would be enabled by the 
capacity to generate a new library of candidate molecules in the vicinity of this good region 
15 of shape space and select or screen for candidates with improved capacities to accomplish 

10 both tasks. A succession of such steps, generating and testing new libraries directly or in 
part computationally, would then constitute a lead optimization procedure with respect to 
these two tasks. 

20 In practice, traditional applications of combinatorial chemistry and or molecular 

diversity have faced an additional obstacle: with respect to combinatorial chemistry, it has 
15 proved difficult to deconvolve highly diverse libraries of small organic molecules, and 
pharmaceutical companies are moving in the direction of attempting to make focused 
25 libraries, typically built by derivatization of a common core molecular structure at many 

sites, each in many ways. In short, screening is tending to move from high throughput - high 
diversity libraries, to restricted or low diversity libraries, even batches of ten or fewer 
20 molecular species. Thus, there is a need in the art for enhanced means to generate focused 
libraries of high diversity, but localized to specific regions of shape space. 

Prior to the onset of molecular diversity, combinatorial chemistry, and high 
throughput screening, rational drug design was the procedure of choice to construct one or 
more molecular species to test as potential drugs. Here the aim has been to understand the 
25 target "receptor" site, such as, for example, the structure, conformation, pose, or epitope of 
the site and rationally design the candidate drugs to bind to the target site. 

Without the guidance of initial candidate molecules to assist the rational design 
process, however, rational drug design can be a labor intensive time consuming process, 
which may not readily arrive at an ideal drug design. Thus, there is a need in the art to find 
30 a means to marry molecular diversity - combinatorial chemistry with rational drug design to 
overcome the problems above, such that initial and optimized leads can be achieved at 
45 lower cost and higher efficiency. The present invention achieves these needs. 
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5 3. CIT frlMARY OF TTTF INVENTION 

The present invention preferably provides a method for predicting a property of a 
molecule comprising the steps of: 

obtaining an initial odd set of molecules that bind at least one molecule 

10 5 belonging to an origin set of molecules; 

obtaining an even set of molecules that bind at least one molecule belonging 

to said odd set of molecules; 

selecting a training set comprising a subset of the even set of molecules; 
15 determining a conformation for each of the training set molecules 

10 constructing a model for predicting a predetermined property of at least one new 

molecule not assigned to the subset of the even set of molecules and wherein the new 
molecule has a new conformation and the model includes the conformation of at least some 
of the training set molecules; and 

predicting the predetermined property of the new molecule. 
15 In a preferred embodiment, the method further comprises selecting a training set 

comprising a subset from the odd set of molecules and repeating the determining, 
constructing, and predicting steps wherein the model further comprises the conformation for 
each molecule in the odd subset 

In another embodiment, the method comprises predicting a predetermined property 
20 of at least one molecule assigned to one of the subsets, conditionally modifying the model 
in response to a difference between said predicted predetermined property and an empirical 
estimate of said predicted property, and repeating said predicting and conditionally 
modifying steps until said difference reaches a predetermined value. 

In one embodiment, the predetermined property is the ability to bind to at least one 
25 predetermined molecule. 

The model may comprise at least one of a neural network, a factor analysis, or a 
principal components analysis. 
40 The neural network may comprise a plurality of layers each having at least one node 

wherein the plurality of layers include a first layer having at least one node coupled to an 
30 input value and a second layer having at least one node coupled to a plurality of nodes of 
said first layer and a first layer having at least one node with a first transfer function and the 
45 second layer having at least one node with a second transfer function. 

The method of claim conformation is determined by at least one of x-ray 
crystallography, nuclear magnetic resonance, or ab initio molecular modeling. 
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The conformation may comprise at least one of an absolute positions of atomic 
nuclei in each molecule, a relative position of atomic nuclei in each molecule, an electron 
density distribution, a bond angle, a bond length, or a van der Waals radii of atoms in the 
molecule. A conformational data base may be searched for molecules having a 
5 conformation similar to the new conformation. At least a portion of the new molecule may 
be synthesized. 

The new molecule preferably comprises at least one of DNA, RNA, a peptide, a 
polypeptide, or a small molecule. In another embodiment, one or more first variants of the 
15 new m oiecule that are at least somewhat similar to the new molecule are produced and 

10 one or more of the first variants having at least one desired characteristic are selected. 
The first variants preferably comprise a stochastic sequence of polynucleotides. 
In another embodiment, antibodies are raised against the new molecule. 
20 Another embodiment of the present invention comprises the steps of: 

obtaining an initial odd set of molecules that bind at least one molecule 

15 belonging to an origin set of molecules; 

obtaining an even set of molecules that bind at least one molecule belonging 

to said odd set of molecules; 

obtaining an odd set of molecules that bind at least one molecule belonging 

to said even set of molecules; 
20 repeating said obtaining an odd set of molecules and said obtaining an even 

set of molecules steps to generate a sequence of odd and even sets of molecules wherein the 
molecules in each of said sets bind to at least one of the molecules in a preceding one of the 

sets in the sequence; and 

selecting a training set comprising an even subset from each of at least two 

25 even sets of molecules; 

determining a conformation for each molecule in each of said subsets; 
constructing a model for predicting a predetermined property of at least one 
new molecule not assigned to the subsets of molecules wherein the model comprises the 
conformation of at least some of the molecules from each even subset; and 
30 predicting a predetermined property of the new molecule. 

In a preferred embodiment, the method further comprises selecting a training set 
45 comprising a subset from each of at least two odd sets of molecules and repeating the 

determining, constructing, and predicting steps wherein the model further comprises the 
conformation for each molecule in each of odd subsets. 
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In a more preferred embodiment, the method further comprises the steps of: 

predicting a predetermined property of at least one molecule assigned to one 

of the subsets; 

conditionally modifying the model in response to a difference between said 
5 predicted predetermined property and an empirical estimate of said predicted property; and 
repeating said predicting and conditionally modifying steps until said 
difference reaches a predetermined value. 

The predetermined property is preferably the ability to bind to at least one 
predetermined molecule. 
10 The model preferably comprises at least one of a neural network, a factor analysis 

model, a principal components analysis model, or an independent component analysis 
model. 

Another embodiment of the invention comprises the steps of: 
selecting a first origin set of molecules; 
15 obtaining an initial odd set of molecules that binds at least one molecule 

belonging to the first origin set of molecules; 

obtaining an even set of molecules that bind at least one molecule belonging 

to said odd set of molecules; 

obtaining an odd set of molecules that bind at least one molecule belonging 

20 to said even set of molecules; 

repeating said obtaining an odd set of molecules and said obtaining an even 
set of molecules steps to generate a sequence of odd and even sets of molecules wherein the 
molecules in each of said sets bind to at least one of the molecules in a preceding one of the 
sets in the sequence; 

25 selecting a second origin set of molecules and repeating said obtaining an 

initial odd set, obtaining an even set, obtaining an odd set and said repeating steps to 
generate a second sequence of odd and even sets of molecules; 

selecting a training set comprising an even subset from each of at least two 
even sets of molecules belonging to the first and second sequences; 
30 determining a conformation for each molecule in each of said subsets; 

constructing a model for predicting a predetermined property of at least one 
new molecule not assigned to the subsets of molecules wherein the model comprises the 
conformation of at least some of the molecules from each even subset; and 
predicting a predetermined property of the new molecule. 
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Preferably, the predetermined property comprises the ability of the new molecule to 
bind to each of at least two predetermined molecules. 

4. RIIIFF DESCRI PTION of THE FIGURES 
5 FIG. 1 is a flowchart representing an example of a process for obtaining a targetshape 

group. 

FIG. 2 discloses a representative computer system in conjunction with which the 
embodiments of the present invention may be implemented. 

10 5. TIFT AILED DESCRIPTION OF THE PREFERRED EMBODIMENT 

The present invention has as its object, a means to obtain one or more initial 
candidate molecules, e.g., lead molecules in a drug discovery process, that are at least 
somewhat dissimilar to a chosen target molecule or "targetshape", to produce initial 
variants of the initial candidate molecules, and to screen or select candidates from among 
15 those variants for molecules that are either more or less similar to the targetshape. The 
process of producing variants and screening or selecting from among the variants for 
molecules may by repeated at least once. Hence, the present invention provides a means to 
carry out an adaptive walk in molecular shape space to climb towards or away from close 
mimics of a given targetshape. 
20 A further object of the present invention is to provide a means of generating 

diversity libraries of candidate molecules that are "focused" into a selected region of shape 
space. For example, without limitation, consider the problem of finding a molecule able to 
bind the estrogen receptor and also able to bind some other receptor, X. Initial candidates 
might bind so weakly to both receptors as to be undetectable. The present invention 
25 provides a means of identifying initial candidate molecules that are only modestly "close" to 
being able to bind both the estrogen and X receptors. These initial candidates likely occupy 
the same region of molecular shape space that is or would be occupied by improved 
candidates that are better able to bind both receptors. Improved candidates can be sought by 
generating a new library of candidate molecules focused in the same general vicinity of 
30 shape space as the initial candidates and selecting or screening for improved candidates 
better able to bind simultaneously both receptors. An iterative succession of such steps, 
obtaining and testing new libraries directly or in part computationally, constitutes a lead 
optimization procedure with respect to these two tasks. 

A variety of characteristics may be used to select molecules according to the 
35 invention. According to one mode of carrying out the process according to the invention, 
the property serving as the criterion of selection is that of having at least one epitope 
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similar to one of the epitopes of a given antigen or other molecule. According to another 
mode of the invention the criterion for selection may be the capacity of a molecule to bind a 
given antigen, other molecule, or surface. According to yet another mode, the criterion for 
selection may be the capacity of a molecule to displace a member of two or more bound 
5 molecules. 

The property serving as the criterion for selection can be the capacity of the 
molecule to catalyze a given chemical reaction. For instance, for the production of several 
peptides and/or polypeptides, the said property can be the capacity to catalyze a sequence of 
reactions leading from an initial group of chemical compounds to at least one target 
10 compound. 

The said property can also be the capacity to modify selectively the biological or 
chemical properties of a given compound, for example, the capacity to selectively modify 
the catalytic activity of a polypeptide or other molecular catalyst. 

The said property can also be the capacity to stimulate, inhibit, or otherwise modify „ 
15 at least one biological function of at least one biologically active compound, chosen, for 
example, among the hormones, neurotransmitters, adhesion factors, growth factors, and 
specific regulators of DNA replication and/or transcription and/or translation of RNA. 

The invention also has as its object the use of the molecule obtained by the 
processes of the invention, for the measurement, e.g., qualitatively, quantitatively, or both 
20 of an analyte or other target molecule. 

According to a particularly advantageous mode of carrying out the invention, the 
desired characteristic of the molecule is the capacity to simulate or modify the effects of a 
biologically active molecule, for example, a protein, and screening and/or selection for 
clones of transformed host cells producing at least one peptide or polypeptide having this 
25 property, is carried out by preparing antibodies against the active molecule, then utilizing 
these antibodies after their purification, to identify the clones containing this peptide or 
polypeptide, then by cultivating the clones thus identified, separating and purifying the 
peptide or polypeptide produced by these clones, and finally by submitting the peptide or 
polypeptide to an in vitro assay to verify that it has the capacity to simulate or modify the 
3Q effects of the said molecule. 

It is known in the art that, as a non-limiting example, if one screens a phage display 
or RNA aptomer library for sequences that bind to a receptor, a number of different 
sequences will do so with a distribution of affinities. If one takes the set of sequences 
binding the receptor above a chosen threshold of affinity, and examines the sequences, it is 
35 known in the art that this set can often be organized into one or more families, each of 
which has a consensus sequence. Thus, the consensus sequence itself, plus some family of 
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related sequences, are candidates to bind the receptor. Often, but not always, the consensus 
sequence itself, if constructed, will bind the receptor. 

The invention carries over to obtaining polypeptides by the process specified above 
and utilizable as chemotherapeutically active substances. 

5 

S.l Molecular Div ersity Libraries 

5.1.1 Recombina nt Techniques 

A variety of means are available for the generation of molecular diversity libraries. 
For example, and not by way of limitation, a process for obtaining DNA. RNA, peptides, 
10 polypeptides, or proteins through the use of transformed host cells containing genes capable 
of expressing these RNA's peptides, polypeptides, or proteins, i.e., by recombinant DNA 
techniques as described in U. S. Patent No. 5,824,514 to S. Kauffman et al., U.S. Patent No. 
5,763,192 to S. Kauffman et al., U.S. Patent No. 5,723,323 to S. Kauffman et al., M. Pavia 
et al. Bioorg. & Med. Chem. Ltrs., 1993, 3:387, and J. Devlin, et al. Science, 1990, 249:404 
15 which are hereby incorporated by reference in their entireties. Using such techniques, a 
library of expression vectors containing stochastically generated polynucleotide sequences 
is formed. Host cells containing the vectors are cultured so as to produce peptides, 
polypeptides, or proteins encoded by the stochastically generated polynucleotide sequences. 
Screening or selection is carried out on such host cells to identify a peptide, polypeptide or 
20 protein produced by the host cells which has a predetermined property. The stochastically 
generated polynucleotide sequence which encodes the identified peptide, polypeptide, or 
protein is then isolated and used to produce the peptide, polypeptide, or protein have the 
predetermined property. 

25 5.1.2 RanHnTji Chemistry 

Another approach to generating a diversity of compounds is described in Patent No. 
W09424314 to Kauffman and Rebek, incorporated herein by reference in its entirety, 
discloses the generation of new compounds using random chemistry, with or without 
enzymes, and the subsequent characterization or identification of compounds with a desired 

30 property. 

In one random chemistry approach, a starting group of different organic molecules is 
provided. At least one chemical reaction is caused to take place with at least some of the 
different organic molecules in the starting group to create an intermediate reaction mixture 
having one or more organic molecules different from the organic molecules in the starting 
35 group. The step of causing at least one chemical reaction to take place is repeated at least 
e. Subsequent repetitions uses the reaction mixture of the previous step, and in the end 
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produces a final reaction mixture as a result of the last repetition. The final reaction mixture 
is screened for the presence of the organic molecule having a desired property. 

In another approach to random chemistry, a diversity of compounds is generated 
from a group of substrates which are subjected to a group of enzymes representing a 
5 diversity of catalytic activities. As used herein, the term "enzyme" includes enzymes (e.g., 
naturally or non-naturally occurring or produced), catalysts (e.g., catalytic surfaces), 
candidate catalysts and candidate enzymes (e.g., antibodies, RNA, DN A or random 
peptides/polypeptides). The substrates may have different or similar core structures, and 
15 ^^1^ or different functional groups as substituents. Alternatively, the substrates may 

10 have different or similar core structures and different or similar functional groups as 
substituents. The substrates may have similar or identical core structures, but a variety of 
different functional groups as substituents permitting the creation of a diversity of 
20 compounds centered around a particular compound or a particular class of compounds. 

For example, one may react a group of different enzymes representing a diversity of 
15 catalytic activities under suitable conditions with a group of different substrates, thereby 
producing one or more organic molecules different from the enzymes and substrates in the 
25 reaction mixture; screen the reaction mixture for the presence of an organic molecule having 

a desired property; and isolate from the reaction mixture the organic molecule having the 
desired property. In another approach, one may react a group of different enzymes 
20 representing a diversity of catalytic activities under suitable conditions with a group of 
different substrates, thereby producing one or more organic molecules different from 
enzymes and substrates in the reaction mixture; screen the reaction mixture for the presence 
of an organic molecule having a desired property; and determine the structure or functional 
properties characterizing the organic molecule have the desired property. 
25 Using a random chemistry approach, at least two ways are provided for generating a 

diversity of molecules, one which does not use enzymes, but uses a variety of possible 
adducts or other molecules which may undergo reactions with the initial molecule of 
interest, and also uses a variety of chemical reagents and physical conditions to drive the 
synthesis of a library of derivatized products of the initial molecule. Alternatively, the core 
30 initial molecule plus a set of candidate adducts and other molecules which may react with 
the initial molecule are used, but also included is a set of enzymes which may increase the 
45 rate of formation of the local high diversity library of derivatized forms of the initial 

compound. It will be readily appreciated by those of ordinary skill in the art that the 
methods for producing general high diversity libraries of product molecules and for 
35 producmg local high diversity libraries of derivatized forms of an initial compound may be 
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5 combined. For example, a new initial compound may be generated by the general 

procedure (e.g., substrates with different core structures). Such a new compound is then 
used, with or without derivatives, to generate a local high diversity library of derivatized 
forms of the compound. Further, it will be evident to those of ordinary skill in the art that 

10 5 libraries may be generated using a combination of random chemistry methods without 

enzymes and with enzymes. 

5.1.3 production of Antibodies 
15 Described herein arc methods for the production of antibodies capable of 

10 specifically recognizing one or more target epitopes or molecules. Such antibodies may 
include, but are not limited to polyclonal antibodies, monoclonal antibodies (mAbs). 
humanized or chimeric antibodies, single chain antibodies, Fab fragments. F(ab') 2 
20 fragments, fragments produced by a Fab expression library, anti-idiotypic (anti-Id) 

antibodies, and epitope-binding fragments of any of the above. Such antibodies are useful 
15 as shape complements to one or more target molecules as part of a molecular diversity 
library according to the invention. 
25 For the production of antibodies to target epitope or molecule, various host 

animals may be immunized by injection with the target molecule a portion thereof. Such 
host animals may include but are not limited to rabbits, mice, and rats, to name but a few. 
20 Various adjuvants may be used to increase the immunological response, depending on the 
host species, including but not limited to Freund's (complete and incomplete), mineral gels 
such as aluminum hydroxide, surface active substances such as lysolecithin, pluronic 
polyols, polyanions, peptides, oil emulsions, keyhole limpet hemocyanin, dinitrophenol, and 
potentially useful human adjuvants such as BCG (bacille Calmette-Guerin) and 
25 Corynebacterium parvum. 

Polyclonal antibodies are heterogeneous populations of antibody molecules 
derived from the sera of animals immunized with an antigen, such as target molecule, or an 
antigenic functional derivative thereof. For the production of polyclonal antibodies, host 
animals such as those described above, may be immunized by injection with the target 
30 molecule supplemented with adjuvants as also described above. 

Monoclonal antibodies, which are homogeneous populations of antibodies to 
45 a particular antigen, may be obtained by any technique which provides for the production of 

antibody molecules by continuous cell lines in culture. These include, but are not limited to 
the hybridoma technique of Kohler and Milstein, (1975. Nature 256:495-497; and U.S. 
35 Patent No. 4,376,1 10), the human B-cell hybridoma technique (Kosbor et al., 1983, 
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Immunology Today 4:72; Cole et ah, 1983, Proc. Natl. Acad. Sci. USA 80:2026-2030), and 
the EBV-hybridoma technique (Cole et al., 1985, Monoclonal Antibodies And Cancer 
Therapy, Alan R. liss, Inc., pp. 77-96). Such antibodies may be of any immunoglobulin 
class including IgG, IgM, IgE, IgA, IgD and any subclass thereof. The hybridoma 
5 producing the mAb of this invention may be cultivated in vitro or in vivo. Production of 
high titers of mAbs in vivo makes this the presently preferred method of production. 

In addition, techniques developed for the production of "chimeric antibodies" 
(Morrison et al., 1984, Proc. Natl. Acad. Sci., 81:6851-6855; Neuberger et al., 1984, Nature, 
312:604-608; Takeda et al., 1985, Nature, 314:452454) by splicing the genes from a mouse 
10 antibody molecule of appropriate antigen specificity together with genes from a human 
antibody molecule of appropriate biological activity can be used. A chimeric antibody is a 
molecule in which different portions are derived from different animal species, such as 
those having a variable region derived from a murine mAb and a human immunoglobulin 
constant region. 

15 Alternatively, techniques described for the production of single chain 

antibodies (U.S. Patent 4,946,778; Bird, 1988, Science 242:423-426; Huston et al., 1988, 
Proc. Natl. Acad. Sci. USA 85:5879-5883; and Ward et al., 1989, Nature 334:544-546) can 
be adapted to produce antibodies to one or more target molecules. Single chain antibodies 
are formed by linking the heavy and light chain fragments of the Fv region via an amino 

20 add bridge, suiting in a single chain polypeptide. 

Antibody fragments which recognize specific epitopes may be generated by 
known techniques. For example, such fragments include but are not limited to: the F(ab') 2 
fragments which can be produced by pepsin digestion of the antibody molecule and the Fab 
fragments which can be generated by reducing the disulfide bridges of the F(ab") 2 fragments. 

25 Alternatively, Fab expression libraries may be constructed (Huse et al., 1989, Science, 
246:1275-1281) to allow rapid and easy identification of monoclonal Fab fragments with 
the desired specificity. 

52 Detection of Molecules a nd Molecular Binding 
30 A variety of means are available which allow characterization, e.g., measurement 

quantitatively, qualitatively, or both, of low concentrations of one or more species of a 
desired molecule in a mixture of molecules generated by the methods provided herein. A 
variety of means are also available which allow characterization of binding or affinity 
between molecules. 

35 
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In general, the methods of the invention comprise ascertaining the presence of a 
molecule having a desired property and/or measuring the abundance of a molecule having a 
desired property in a set or mixture of molecules generated by the methods provided herein. 
A variety of cell systems are well known to those of ordinary skill in the art which 
5 allow measurement of low concentrations of ligands, e.g., ligands binding a hormone 
receptor. In this regard, for example, a system has been developed which clones human G 
peptide hormone receptors into frog melanocytes (Lerner, Proc. Natl. Acad. Sci. USA). 

The hormone receptors, typically located in the cell membrane, respond to binding 
of the corresponding hormone, but trigger a cell response releasing or reabsorbing 
10 melanophores. In a forty minute reversible cycle, cells darken dramatically, then can be 
induced to lighten in color again. Response of the cell depends upon the affinity of the 
hormone for the receptor. Typical responses occur in the nanomolar to 100 picomolar 
hormone concentration range. For some hormone receptorhormone pairs where affinity is 
higher, response occurs in the picomolar hormone concentration range. This cell system is 
15 an example of an assay system which allows measurement, in a mixture of molecules, of 
one or more species of ligands able to bind to the receptor. The set of molecule ligands able 
to bind the receptor are then the ligands of interest, for they are candidates to act as drugs by 
antagonizing, agonizing, substituting for, or modifying the effects of the natural hormone. 
Alternatively, according to the methods of the invention, the ligands of interest may be 
20 those not binding the receptor. 

A second example of a cell assay is that available commercially from Molecular 
Devices (Palo Alto, CA). It consists of an array of chemfets which respond to very small 
changes in local pH. In turn, these small pH changes reflect the altered metabolic activity of 
a population of cells upon receipt of some molecular signal, such as a hormone binding its 
25 receptor. For example, cell assays in which a hormone binds a receptor are known to those 
of ordinary skill in the art and allow nanomolar or subnanomolar concentrations of the 
hormone ligand to be measured. A preferred means of using the present invention consists 
in exposing such cells to a high diversity library of molecules or target shape set of 
molecules generated by the methods provided hereinl to ascertain the presence of or 
30 measure the abundance of one or more species of molecules able to trigger the cell response. 
That set of molecules, each of which is highly likely to bind the hormone receptor are the 
molecules of interest. 

Another example is to use blast B cells, which on their surface express antibodies 
directed to a molecule of interest, to detect in a high diversity library the presence of 
35 molecules which sufficiently mimic the molecule of interest to be able to bind to its 
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antibody on a B cell. Thus, an animal is immunized with a molecule of interest and the early 
B cells isolated. A high diversity library of molecules generated by the methods provided 
herein is screened using the population of B cells. For example, binding may stimulate cell 
cycling or division by the last B cell bound. Cell cycling or division may be detected by 
5 means known in the art. 

Alternatively, a variety of assays to detect the presence of a ligand of interest exist 
which are based on direct binding assays. Thus, for example, a receptor for a hormone can 
be used directly to detect binding of a radioactivity labeled ligand. Other means, known in 
the art, to accomplish this include the following: 
10 (i) The estrogen receptor is used as a non-limiting example. The cloned receptor can 

be affixed to a flat surface, for example, a filter. Very high specific activity estrogen is 
prepared, and bound to the receptor population. This set of bound receptors is then used in 
a competitive assay. The bound receptors are exposed to a library of compounds generated 
by the methods of the present invention. If the library contains ligands which also bind the 
15 estrogen receptor, those ligands will compete with the radioactively labeled estrogen itself 
for the receptors. Hence the radioactively labeled estrogen will be competitively displaced 
from the receptors and can readily be detected by means known in the art. Thus, this assay 
allows detection of one or more species of ligands in the mixture which compete with 
estrogen for the estrogen receptor. This set of ligands is the set of interest, as they 
20 are candidates to be drugs rmmicking or antagonizing estrogen. 

(ii) The estrogen receptor is again used as a nonlimiting example. By means known 
in the art, one raises antibody molecules which are able to bind the receptor when the 
receptor is not bound by estrogen, but not bind the receptor when occupied by estrogen. 
Alternatively, one generates antibody molecules which bind the estrogen receptor only when 
25 the receptor itself does bind estrogen. These antibody molecules can then be decorated with 
reporter groups by a variety of means known in the art, and used to detect the presence of 
one or more ligand species in a library of high diversity, which bind to the estrogen 
receptor. In the case of antibodies which only bind the receptor if the receptor is itself 
unbound by estrogens, one tests for loss of antibody binding in the presence of the library of 
30 compounds and in the simultaneous absence of estrogen. In the case of antibodies which 
bind the receptor only if the receptor is bound by estrogen, one tests for an increase in 
binding of the antibody in the presence of the receptor and high diversity library. 

(iii) m order to detect ligands in a high diversity library which are candidates to 
mimic or antagonize the action of a given hormone or other molecule of interest, it is 
35 advantageous to generate one or more monoclonal antibodies which bind the hormone or 
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other molecule of interest This set of monoclonal antibodies can then be used, rather than a 
receptor, for the target molecule that is to be mimicked, in binding assays such as those 
noted above to detect the presence of one or more ligand species in the reaction mixture 
which are candidates to mimic or antagonize the action of the target molecule. An 
5 advantage of this procedure is that a receptor for the target molecule need not be available. 
Use of a set of monoclonal antibodies is advantageous because, a priori, it is not certain 
which molecular feature, or epitope, of the target molecule mediates its biological action. 
Use of a set of monoclonal antibodies, each responding to a different epitope on the target 
molecule, enhances the probability that the ligands detected in the high diversity library will 
10 include those which mimic the biologically important epitope of the target. In some cases it 
may be possible to selectively use only those monoclonal antibody molecules which bind to 
the known important epitope of the target molecule. 

(iv) Means are established in the art to measure protein-protein binding based on 
plasmon resonance and detection of a shift in refractive index. In a detection system 
15 developed by Pharmacia (Piscataway, NJ), a monoclonal antibody, or a hormone receptor is 
layered onto a gold chip. Binding of hormone, or other ligands to a receptor, is measured in 
very low concentrations (e.g., in the nanogram range or less). Thus, any receptor, or 
antibody, or other "shape complement" of a target molecule of interest can be placed on the 
gold chip, the latter can be exposed to a high diversity library, and the presence of binding 
20 species can be measured quantitatively, qualitatively, or both. 

Another example of direct measurement of ligand-binding, which the applicant 
believe was developed by Evotech, can measure ligand binding in the femtomolar range. 

A variety of approaches for characterizing molecular binding are based on 
fluorescence correlation spectroscopy. For example, Rudolph Rigler 1995, J. 
25 Biotechnology, 41:177 has reviewed fluorescence correlation approaches to measuring 
molecules and binding of molecules. In one approach, a laser beam is focused to a radius of 
less than about 1 micron. Fluorescent molecules or molecules labeled with fluorescent tags 
can be measured at femtomolar concentrations, (10 -15 M), in tens of seconds. Binding of a 
fluorescent molecule or a molecule labeled with a fluorophore to another molecule can be 
30 characterized, e.g. , measured qualitatively, quantitati veiy or both because of the reduced 
diffusion coefficient of the bound molecules compared to the unbound molecules. Similar 
approaches based on the different electrophoretic mobilities of bound and unbound 
molecules are known in the art. Competitive assays in which a molecule displaces a 
member of at least two bound molecules can be used to assess the relative binding 
35 efficiency or affinity of a set of molecules for one or more other molecules. 
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Thus, for example, if estrogen is a target molecule, and a small RNA aptomer is a 
shape complement which binds estrogen, then fluorescently labeled versions of that RNA 
aptomer can be used in a fluorescence correlation approach. An estrogen-mimic which 
binds the fluorescently labeled RNA will slow its diffusion as detected in the laser system. 
5 Thus estrogen-mimics at very low, 10 13 M or femtomolar, concentrations can be detected. 
Alternatively, one may begin with a number of complexes comprising estrogen and the 
fluorescently labeled RNA aptomer. Adding one or more molecules that compete with the 
RNA aptomer for binding sites on estrogen can be detected by the appearance of unbound 
labeled RNA aptomer. One or ordinary skill in the art recognizes that a number of possible 
10 approaches for detecting the binding and binding characteristics using a fluorescence based 
approaches are possible. 

A further means to detect ligands of interest at very low concentrations consists in 
seeking ligands which block a DNA polymerase. By blocking the DNA polymerase chain 
reaction (PCR) enzyme, amplification of the DNA can be blocked. Since PCR 
15 amplification can yield billions or more copies of the initial DNA sequence, blocking PCR 
amplification yields a readily detectable signal of a ligand which blocks the polymerase. 
Clearly, this method generalizes to other means to amplify DNA, RNA, or DNA- or 
RNA-iike molecules such as ligation amplification, and extends to general means to block 
polymerases directf y or indirectly with ligands of interest 
20 As described herein, compounds of interest may act as catalysts for a desired 

reaction, or as cofactors with other molecules to form an active catalyst. Other molecules 
may act as inhibitors of enzymes. In order to exclude the possibility that the enzymes or 
catalysts are found among the candidate set of enzymes which may have been used to 
generate the compounds of interest, the latter set of enzymes can be quantitatively removed 
25 from the high diversity library by, for example, affinity columns bearing molecules directed 
to a constant part of each of the set of enzymes, or other means known in the art. The 
resulting high diversity library itself is then assayed for candidates of interest. 

Detection of molecules able to inhibit an enzyme may proceed by detecting ligands 
able to bind the enzyme, as described above. Identifying molecules which are candidates to 
30 catalyze a reaction alone or as a cofactor may proceed by testing high diversity libraries of 
the invention alone, or in the presence of a helper molecule, say a protein, for which a 
desired molecule will be a cofactor. The system is tested for the presence of ligands able to 
bind a stable analogue of the transition stale of the reaction. Such binding molecules are the 
candidate catalysts or cofactors sought, for they are candidates to catalyze the reaction itself. 
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Alternatively, a variety of means are known in the art which allow detection of the 
products of a catalyzed reaction itself. For example, chromogenic or fluorogenic substrates 
for a variety of reactions of interest are available. Catalysis of the reaction increases the rate 
of formation of the colored or fluorescent product. Alternatively, assay systems are 
5 available or readily prepared which detect the presence of a product molecule because that 
product molecule binds a receptor an antibody molecule, or other shape complement. Thus, 
detection of higher rates of formation of that product molecule demonstrates that the 
reaction itself was catalyzed. 

10 5.3 Characterization of Molf ^ilfir Libraries 

Following the generation of high diversity libraries of compounds and the screening 
for the presence of compounds having properties of interest, compounds of interest may be 
characterized with or without isolation. A variety of means, including those known in the 
art, are available to characterize or isolate such compounds of interest. 
15 Characterization and/or isolation depend upon the information desired and can be 

carried out at different mole abundances of the target molecule of interest. For example, 
using modern mass spectrographic analysis, about 10 15 to 10' 18 moles can be assayed for 
mass and charge, then fragmented in a variety of ways known in the art and the fragments 
assayed for mass and charge. Using such data, it is possible to derive the structure of the 
20 molecule of interest For example, ligands of interest may be isolated by binding to a given 
hormone receptor, or monoclonal antibody, then the liganding molecules released by means 
known in the art and finally characterized analytically. One means comprises attaching a 
target receptor or antibody to a solid support. A reaction mixture or subset thereof is 
contacted with the solid support. Those molecules that are bound will be retained, while the 
25 non-bound molecules are readily separated from the solid support. The molecules of 
unknown structure which have been retained, are then eluted. The freed molecules are 
characterized analytically, e.g., by mass spectroscopy, NMR, IR, UV, and may be 
synthesized in batch quantities. 

Kibbey et al. U.S. Patent No. 5,670,054, disclose an automated method of sample 
30 identification, purification and quantitation wherein a first HPLC column with defined 
operating parameters is used to separate a small portion of an impure mixture into its 
constituent components; the individual components corresponding to the eluting zones of 
the separated mixture are characterized by mass spectrometry; the chromatographic and 
mass spectroscopic data generated are stored in digital format, for example one compatible 
35 with commercial chromatography software, and the data is used to guide the purification of 
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the remaining sample; the remaining sample is injected on a semi-preparative, or 
preparative HPLC column; an analog detector output of the semi-preparative, or preparative 
HPLC system is digitized and evaluated electronically with the previously generated 
chromatographic and mass spectroscopic data; when elution of a sample component peak 

5 corresponding to a desired product peak is sensed, a mechanically actuated, liquid switching 
value (i.e., a pneumatic or electronic switching valve) is actuated to divert the column eluate 
from waste to a fraction collection device; and when the end of product peak elution is 
sensed, the switching valve is actuated to divert the column eluate back to waste collection. 
The system enables rapid purification of samples in quantities useful for screening of 

10 diversity libraries while involving minimal operator input and rrunimum fraction collection 
equipment 

In other cases, the concentrations of molecules of interest in the high diversity 
library will allow detection of their presence, but may be too low for further isolation or 
characterization. A preferred procedure called "sib selection" allows ready winnowing of 

IS the set of candidate enzymes, the set of founder substrates, and the set of reaction conditions 
and chemical reagents to smaller sets. This winnowing simultaneously reduces the side 
products generated in the high diversity library, increases the concentration of the target 
molecule of interest, and identifies the subset of candidate enzymes which catalyze the 
pathway leading to synthesis of the target molecule, and identifies the set of founder 

20 substrates required for synthesis of the desired target. Thus, this sib selection procedure is a 
means to generate a previously unknown molecule of interest, as well as identify both that 
molecule and the substrates and enzymes needed to form that molecule. 

5.4 TaTpptshape Groups 
25 A targetshape of molecules, e.g., molecular diversity library, according to the 

present invention comprises a group of n sets of molecules si (i = 0 n), wherein each set 

si contains at least 1 molecule. Set sO generally contains one compound and represents the 
center or "targetshape" of the group of n sets of compounds. The group of n sets of 
molecules corresponding to set sO and having set sO at its center or origin is referred to as 

30 targetshape s. Each set of molecules may also be referred to as a "ring" or "shell." A given 
ring si is said to have a higher order than ring s(i-l). 

Initially, to obtain or generate members of targetshape set si, a molecular diversity 
library composed of DNA, RNA, peptides, polypeptides, small molecules, or other 
compounds is generated or obtained and screened to obtain a set of molecules si able to 

35 bind a predetermined targetshape or set sO. Alternatively, or in addition, members of si 
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may be found by using members of targetshape set sO to raise antibodies against sO. Given 
that the members of set si bind members of sO, members of si generally have at least one 
epitope or shape feature that is at least somewhat complementary to at least one epitope or 
shape feature of members of sO. 
5 Members of targetshape set s2 may be found by generating or obtaining and 

screening a molecular diversity library for molecules able to bind s2 and/or members of s2 
may be found by using members of targetshape si to raise antibodies against si. 

Typically, for targetshape sets of order i * 2, any given set si includes a subset of 
members, si\ each of which bind at least one member of set s(i-l) by way of substantially 
10 the same epitope or molecular shape feature as those members of set s(i-l) bind at least one 
member of set s(i-2). Competitive binding assays may be used to identify members of each 
subset. For example, members of subset s2' may be discriminated from the remainder of 
set s2 because sT and sO will compete for the same binding site on one or members of s 1 . 
Given that members of s2' and sO compete for the same binding site on at least one member 
15 of si, members of s2* generally have at least one molecular epitope or shape feature that is 
similar to at least one molecular epitope or shape feature of sO. Thus, members of s2' 
substantially correspond to mimics of sO. 

Selecting members of si' by competitive displacement of members of si off s(i-l)' 
using members of s(i-2)' is analogous to the concept of internal images in the immune 
20 system in second rank antiidiotypes and generally corresponds to the search for shape 
mimics using molecular diversity. 

Figure 1 shows an example of a general process for generating or obtaining 
members of a targetshape group. Initially, an origin set comprised of a targetshape 
molecule is selected. In the next step, an intermediate set of molecules is generated or 
25 obtained. In general, molecules of the intermedi ate set of molecules bind molecules 
belonging to the origin set. A terminal set of molecules is then generated or obtained. In 
general, molecules belonging to the terminal set bind to at least one molecule of the 
intermediate set. Next, a subset of the terminal set is selected. Generally, molecules 
belonging to the subset of the terminal set bind at least one member of the intermediate set 
30 by way of substantially the same epitope that the one member of the intermediate set binds 
at least one member of the origin set. In an iterative process, the origin set is replaced with 
the intermediate set and the intermediate set is replaced with the subset of the terminal set. 
Finally, the steps of obtaining or generating a terminal set, selecting a subset of the terminal 
set, replacing the origin set, and replacing the intermediate set can be repeated until a 
35 plurality of sets of molecules are obtained. 
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In general, members of any targetshape set si can be generated or obtained by 
screening a molecular diversity library composed of DNA, RNA, peptides, polypeptides, 
small molecules, or other compounds for molecules able to bind at least one member of a 
lower ordered targetshape subset si-l\ Alternatively, or in addition, members of si may be 
5 found by using members of targetshape set s(i-l)* to raise antibodies against s(i-l)\ 

Random chemistry approaches beginning with molecules having a core structure similar to 
members of set s(i-2) may also be used generate candidate molecules for set s(i). 
Subsequently, members of a subset si' that compete for the same binding site on s(i-l)' as 
members of s(i-2)' may be discriminated from the remainder of set si. In this fashion, 

10 members of targetshape sets sO, si, s2, s2\ si, si' sn, sn* may be obtained. 

The complete targetshape group of n sets of molecules forms a gradient in shape- 
function space surrounding the targetshape sO. The even rings i.e., sets of a targetshape 
group substantially correspond to shape mimics of sO that are, on average, successively less 
like sO as ring order increases, whereas members of a given odd ring substantially 

15 correspond to shape complements of molecules belonging to successive even rings. For 
example, because members of subset s4' are identified by competitive binding with s2' for 
sites on s3\ then, since s2" members are similar to, but not identical to the targetshape sO, it 
foDows that members of s4' are generally more similar to members of s2' than to sO. 
Consequently, the sets of molecules si, where i is even, comprise a gradient in the shape 

20 space surrounding sO where members of a given subset si* are less similar to sO than 
members of the lower ordered subset s(i-2'). 

By extension, the sets of molecules si, where i is odd, comprise a gradient in shape 
space surrounding si where members of set s(i+2)' are less similar than members of lower 
ordered subsets si' to the si shape complements of sO. It follows that molecules that bind 

25 members of odd numbered rings are successively more simi lar to sO, on average, as they 
bind to odd numbered rings of lower order. That is, for example, molecules binding 
members of s5* are generally more similar to sO than molecules binding members of s7*. 
Thus, the odd numbered sets provide a complementary shape "gradient" to select or screen 
for molecules ever more similar to sO. 

30 In selecting members of a given subset si\ odd or even, it may be advantageous to 

proceed by competitive displacement of members of si off s(i-l)* using only members of 
s(i-2)\ Alternatively, members of set si* may be selected by competition with members of 
any lower ordered subset s(i-k)' where i - k * 0 and k is even. Choosing k > 2 results in a 
less-steep gradient because successive subsets are chosen by competition with molecules 

35 that are somewhat more similar to either si or sO. The value of k need not be the same for 
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obtaining successive subsets. Thus, the choice of k allows the gradient in shape function 
space between successive sets of molecules to be tuned at each step in the process. 

The steepness of the gradient surrounding sO or si may also be modified by setting 
more or less stringent competition or binding requirements for entry into a given set or 
5 subset. 

5.4.1 Multiple Target Problems 

The ability to detect lead candidates that do not bind a target efficiently enough to 
identify by available screening techniques allows the method described above to be 

10 extended to address the multiple target problem. Each characteristic that the molecule must 
posses or criterion that the desired molecule must satisfy can be considered a "task." 
Without loss of generality, and as an example only, consider the problem of finding a 
molecule able to accomplish two tasks such as having the ability to bind receptors of both 
estrogen and progesterone. Defining estrogen as sO, one may obtain a targetshape group of 

15 n sets molecules corresponding to targetshape rings around estrogen. Similarly, define 
progesterone as r0 and obtain sets of molecules corresponding to targetshape rings around 
rO. The targetshape groups surrounding each targetshape need not contain the same number 
of sets of molecules. 

A molecular diversity library is then screened for molecules that bind at least one 

20 member of an odd ring of both targetshape s and targetshape r. Alternatively, or in addition, 
members of at least one even numbered ring of targetshape s and/or targetshape r may be 
used to raise antibodies against the even numbered ring. The antibodies are then screened 
for molecules that bind at least one odd ring of both targetshape s and targetshape r. 
Consider a molecule X that binds to s3* and also to r5\ By way of example, suppose that X 

25 does not bind to si nor to rL Thus, X would remain undetected using conventional 
screening tests for molecules binding only the equivalents of targetshape sets si and rl. 
Conversely, using the full targetshape groups surrounding estrogen and progesterone 
candidates that are only somewhat similar to both estrogen and progesterone may be found. 
To obtain compounds with improved binding to both the estrogen and progesterone 

30 receptors, variants of X are obtained or generated, for example, using a molecular diversity 
approach or other random chemistry approach. Alternatively, or in addition, molecules in 
the same region of shape space as is X are generated or obtained. The new population of 
molecules is screened to identity members that are more similar to both estrogen and 
progesterone as ranked by the ability to bind members of lower ordered odd numbered rings 

35 of targetshape s and/or targetshape r. Thus, a molecule binding at least one member of s3* 
and r5' is an improvement of X, for it is more similar to progesterone and as similar to 
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estrogen. A molecule able to bind at least one member of si and r3' is better than X in 
being more similar both to estrogen and to progesterone. 

The screening power of this extended targetshape approach is particularly 
advantageous in situations where one cannot sample molecular shape space with sufficient 

5 density to identify immediately candidates, if they exist, that bind both si and rl. For 
example, although a molecular diversity library may contain approximately 10 IS distinct 
molecular species* the chance of identifying a potential candidate to mimic both sO and r0 
may be slim. However, using higher order sets of a targetshape group instead of one or a 
small number of target compounds, a broader region of molecular shape space can be 

10 sampled with high density to identify proto-candidates even only somewhat similar to both 
sO and rO. Subsequently, those candidates can be used to generate a molecular diversity of 
shape variants focused around the same general region of shape space as the initial 
candidates. Thus, about 10 15 molecules in that "region" of shape space can be created and 
screened or selected upon for improved variants. The improved variants can then be used to 

15 create still further variants, in an attempt to increase similarity to both sO and rO. 

The process described above can be generalized to seek molecules able to fulfill a 
plurality of arbitrary "Boolean" or logical combinations of "yes" and"no" conditions on 
different targets or different criteria. For example, one may seek a molecule that binds one 
hormone receptor but does not bind the receptor of another hormone, or one may seek a 

20 molecule that binds a cis acting promoter of one gene but does not bind the cis acting 
promoter of another gene, etc. In general, n targetshape groups, r, s, t, .... n would be 
constructed for each of n tasks. Candidates only partially fulfilling one or more of the tasks 
may be identified Improved candidates could be sought by obtaining variants better able to 
fulfill each task separately, or by obtaining variants better able to fulfill any subset t of the n 

25 tasks. Then, the initial candidates may be optimized to seek the practically accessible pareto 
optimal set. 

For example, consider a multitask problem mat includes obtaining molecules that 
bind at least one target, and do not bind at least one other target. As a specific example 
only, consider seeking a molecule that binds the estrogen receptor but does not bind the 

30 progesterone receptor. Targetshape groups are constructed around estrogen, sO, and 
progesterone, rO. Initial candidates are sought that bind to low ordered odd rings of 
targetshape s and either will not bind any odd ring of targetshape r, or will only bind higher 
ordered odd rings of targetshape r so that it is unlikely to interact with or bind the 
progesterone receptor. However, an initial candidate may bind at least one lower ordered 

35 odd rings of targetshape r. By successively generating variants of the initial candidates and 
selecting improved candidates that bind primarily to higher ordered odd rings of targetshape 
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r, variants of initial candidates are "tuned" away from similarity to progesterone or 
interaction with the progesterone receptor. The ability of a variant to bind lower ordered 
odd rings of estrogen may also be considered when selecting improved variants to continue 
the optimization procedure. Thus, targetshape provides a means to "sculpt" molecules to 

5 enhance the ability to bind the receptor of one molecule while decreasing binding to the 
receptor of a second molecule. In this example, even subtle side effects due to undesirable 
binding of a drug to the progesterone receptor would be avoided. 

The initial candidates X can be partially ordered based upon their binding to rings of 
one or more of targetshape groups. It may be advantageous to order fully the candidates 

10 according to a system that assigns different weight to the ability to accomplish each task. 
The candidates can also be ordered based on the absolute or relative number of members of 
a given targetshape ring that bind each candidate and/or the relative strength or efficiency of 
binding. However, it is not necessary, even to optimize the ability to accomplish multiple 
tasks to create a full ordering relationship between all candidate molecules. 

15 A pareto optimal set of candidates, X, is defined as a set of molecules having the 

property that no other molecules exist that are better than the members of the pareto optimal 
set with respect to at least one "task" and at least as good with respect to the remaining 
tasks. In the case where only a partial ordering exists, the pareto optimal set constitutes the 
"end point" of the effort to find good candidates for both tasks. The pareto optimal set is 

20 defined for the set of all possible molecules. Thus, in reality, it is impossible to assure that 
a candidate pareto optimal set, X, is actually pareto optimal. 

In the current context, a pareto optimal set of candidates for the tasks of being 
similar to estrogen and also to progesterone is a set X such that no other molecule exists that 
is more similar to estrogen, and at least as similar to progesterone; or that is more similar to 

25 progesterone and at least as similar to estrogen. 

5.5 Rational Drug Design and Target Shane Groups 

Drug design has historically involved "discovering" a particular chemical substance 
that interacts in some way with receptors, e.g., proteins in the living cells of an organism. 

30 As proteins are made up of polypeptides, it is not surprising that some effective drugs are 
also peptides, or are patterned after peptides. Thus, without limitation, the activity of a drug 
may be couched within the framework of polypeptides. The description, however, is 
applicable to any agent capable of modifying the activity of a biological molecule having a 
receptor. Generally, for two peptides to effectively interact with each other, e.g., one as a 

35 protein receptor and the other as a drug, it is necessary that the complex three-dimensional 
shape ("conformation or pose") of one peptide assume a compatible conformation that 
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allows the two peptides to fit and bind together in a way that produces a desired result. In 
such instance, the complex shape or conformation of a first peptide has been compared to a 
"lock", and the corresponding requisite shape or conformation of the receptor as a "key" that 
unlocks (i.e., produces the desired result within) the first peptide. This "lock-and-key" 

5 analogy emphasizes that only a properly conformed key (second peptide or compound 
patterned thereafter) is able to bind or fit within the lock (First peptide) in order to "unlock" 
it (produce a desired result) Further, even if the key fits in the lock, it must have the proper 
composition in order for it to perform its function. That is, the second peptide must contain 
the right elements in the right spatial arrangement and position in order to properly bind 

10 with the first peptide, e.g., receptor protein. The random diversity or combinatorial 
approach to drug discovery, as described above, does not require direct knowledge of the 
confirmation of the target compounds or potential leads that bind the target compounds or 
are members of a target shape group as described above. As part of the present invention, 
however, discovering or predicting the proper conformation or shape of the key, or second 

15 peptide or compound patterned thereafter can assist the drug discovery, as described below. 
Most polypeptide structures exhibit several conformations that are stable, some 
more so than others. The most stable conformations are the most probable. A conformation 
may change from one stable conformation to another through the application of sufficient 
energy to cause the change. Given the opportunity to freely move, fold and/or bend, a given 

2Q polypeptide chain will eventually assume a stable conformation. The most probable 

conformation that is assumed is the one that would take the most energy to undo. This most 
probable conformation is referred to herein as the "global ntinimum"; Other stable 
conformations are less probable, but may readily be assumed, and are referred to herein as a 
"local minimum" or "local minima". A conformation that represents a local minimum could 

25 thus be changed, through application of an external force, to another stable conformation 
which is either another, different local minimum or the global minimum. 

Only by designing the conformation of one peptide to allow it to fit within the 
conformation of the other peptide and bind thereto will the desired interaction between the 
two peptides take place. Thus, principal generic steps in the rational drug design process 

30 include: (1) identification and determination of the structure of a receptor site or molecule 
binding the receptor site; (2) use of theoretical principles and experimental data to propose a 
series of putative compounds that will bind to the receptor sites, the compounds may be 
synthesized and tested for complementarity with the active site; (3) determination of the 
structure of receptor/ligand complexes that bind with high efficiency, i.e. with low free 

3 5 energies; and (4) iteration of steps 1-3 to further enhance binding. 
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Rational drug design thus includes knowing or predicting the conformation of a 
desired protein receptor peptide. Random chemistry, such as the methods of the present 
invention described above, provide a starting point to rational drug design approaches by 
identifying molecules that have suitable activities, such as binding efficiencies, toward a 

5 target compound. Indeed, the target shape groups of the present invention allow the 

identification of potential candidates that would not be identified using known techniques. 
Moreover, by providing a series of diversity libraries, the target shape rings, the present 
invention provides a data base of molecular shapes, which form a gradient in conformation 
space around a particular target Thus, the present invention is ideally suited to combine a 

10 random chemistry approach with a rational design approach in identifying and designing 
putative drug leads. 

The even and odd numbered rings contain information about the requisite sequences 
and conformations (shape) in molecular shape space that are capable of mimicking a 
receptor such as, for example, estrogen, or are capable of mimicking its complement, for 

15 example, the estrogen receptor. The shape information can be obtained by predicting the 
shape directly from the sequence information, or directly by NMR, x-ray crystallography, 
neutron scattering, high-resolution mass spectrometry, crystallization, or other/experimental 
techniques to examine either members of SI, S3, S5, or S2, S4, S6 alone, or as bound pairs, 
say a member of SI and a member of S2 that bind one another. Higher order interactions, 

20 such as three or more bound molecular species can be examined to ascertain wether a 

combination of more than one compound can modulate the activity of a receptor such as, for 
example, by modulating the binding of C to a receptor. In this fashion, for example, a 
protein phosphorylase can be made to modulate the activation of a transcription factor. 

A molecule such as estrogen, which binds a receptor, such as the estrogen receptor, 

25 also contains information about the shape needed to bind its receptor. The consensus 
sequence noted above and derived from the members of even member rings of at least one 
target shape group themselves contain more information about the sequences and shapes 
able to bind the estrogen receptor than does estrogen itself. But the members of S2 contain 
less information about the requisite sequences in shape space to mimic estrogen than does 

30 SO, i.e. the molecule used to form the target shape group. Additionally, the structures of 
members of S2, S4, S6, ... provide successively less information about the structure and 
conformation needed to bind or interact with the receptor. Thus, these molecules for a 
gradient in shape space, which may be used to guide the selection, search or design of 
additional candidate molecules using data mining techniques as described below. 

35 
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Similarly, the estrogen receptor, a member of SI, contains some information about 
sequence and shape required to bind estrogen, but less information than the entire set S 1 , 
and still less than the members of the odd sets, SI, S3, S5, ... Thus, structural activity 
information from both the even and odd groups can be used in rational drug search, 

5 selection, and design. 

Neural networks, factor analysis, principal component analysis, independent 
component analysis, singular value decomposition and other mathematical data mining 
techniques can be trained on a training set to extract relevant sequence or shape information 
about molecular sequences, structures and shapes needed, for example, to mimic estrogen, 

10 or mimic the estrogen receptor. In general, several factors increase the informing power or 
predictive power of such data mining techniques. First, the averaging effect of larger data 
sets acts to minimize the effect of spurious shapes, e.g. the shapes of molecules which bind 
a target away from the active site but are erroneously included in the model. Second, as 
understood in the art, the informing power is increased when the model or data mining 

15 technique includes orthogonal information along several different vectors. By including, for 
example, molecules from more than one target shape ring the present method 
advantageously takes advantage of the fact that members of, for example, successively 
higher ordered even targetshape rings are increasingly different in both structure and 
binding ability relative to the compound SO. These structural differences help guide the 

20 model or data mining technique toward those features which are most important in 
determining an activity or binding efficiency. 

Molecules from both even and odd targetshape groups can be included in the data 
mining approach. This provides further orthogonal information with which to guide the 
model. Known application of data mining techniques to rational drug design have not made 

25 use of such a gradient of molecules or members of complementary sets of molecules such as 
members of the odd targetshape groups. 

Moreover, as discussed above, the targetshape approach to creating 
molecular diversity can be applied to a multiple target problem by creating more than one 
targetshape groups centered around more than one molecule, such as, for example, the 

30 estrogen and progesterone receptor. Since the members of rings of each targetshape group 
can be tested for properties such as renal clearance or liver toxicity and because members of 
even and odd rings of each targetshape group are families of related shapes, these 
experiments contain information about the requisite molecule structure and shape features 
need to obtain pareto optimal drug leads with respect to a set of one or more design criteria. 

35 Thus, as a non-limiting example, the present invention can be used with the sparse 
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membership of the rings of each targetshape group together with data mining techniques 
such as neural networks to predict structures and shapes that constitute the global pareto 
optimal surface with respect to two or more design criteria. 

Thus, it will be clear that the use of members of the even or odd rings, respectively, 

5 can enhance the capacity of nerual networks or other data mining procedures to extract the 
sequence, structure and shape features required for good binding to or modulation of a 
molecule or receptor such as, for example, estrogen, or to the estrogen receptor. Clearly, 
estrogen and its receptor are a non-limiting example. We could equally consider DN A 
binding molecules, cis site specific DNA binding molecules, molecules binding cis-trans 

10 complexes, where those molecules are DNA, RNA, proteins, small molecules or other 
classes of molecular species such as carbohydrates, lipids, or other polymers, or small 
molecule families such as those now familiar in combinatorial chemistry. 

More generally, the above invention can be used to discover experimentally, or 
predict rationally, combinations of molecular species, say A and B, which together bind and 

15 modulate the activity of, for example, a phosphorylase enzyme acting at some point in a 
cellular signaling cascade, or a transcription complex binding a DNA or RNA cis site 
regulating transcription, or otherwise modulating any desired biological activity within or 
between cells, or organisms a process for discovering A and B may comprise, for example, 
co-crystallization of the species A and B with, for example, an enzyme whose activity they 

20 modulate. 

5.5.1 Molecular St ructure Determination 

The molecular structure of a molecule selected from a screening step can be 
characterized to predict geometric and conformational features, which lead toward a 
25 suitable lead compound. Molecular structure information comprises, but is not limited to, 
for example, the absolute and relative positions of nuclei, the relative electron density 
distribution, bond angles, bond lengths, van der Waals radii of atoms in the molecule, 
chirality, and charge. Additionally, such information can be acquired over all or only a 
portion of a molecule. 

30 Experimental structure determination can be accomplished, for example, using 

X-ray crystallography and solution-state nuclear magnetic resonance (NMR) (Mac Arthur et 
aL, 1994, Trends in Biotechnology 12:149-153). 

X-ray crystallography depends on the interaction of electron clouds with X-rays to 
provide information on the location of every heavy atom in a crystal of interest. The 

35 accuracy of X-ray crystallography is 0.5-2.0 A (1 A=10r* cm). CocrystaJlization allows the 
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structure of more than one bound molecule to be determined providing further information 
about the conformation of the active site and complex. Such information can be used, for 
example, to design, select, or optimize molecules that act in tandem to modify the activity 
of a biological molecule. 
5 Approaches for structure determination using NMR, such as, for example, methods 

described in Biomolecular NMR Spectroscopy by Jeremy N.S. Evans, Oxford University 
Press, New York, 1 995 and Nuclear Magnetic Resonance of Proteins and Nucleic Acids by 
Kurt Wttthrich, John Willey and Sons, 1986, which are incorporated in their entireties by 
reference. NMR, relies upon correlations between nuclear spins resulting from 
10 dipole-dipole interactions indirectly mediated by the electron clouds. High-resolution, 
multidimensional, solution-state NMR techniques are an attractive alternative to 
crystallography since that they can be applied in situ (i.e., in aqueous environment) to the 
study of small protein domains (Yu et al., 1994, Cell 76:933-945). Solution-state NMR has 
been successful at determining the structure of moderate-sized proteins and protein/ligand 
15 complexes to a 2 A. resolution, similar to that of X-ray methods (Mac Arthur et al., 1994, 
Trends in Biotechnology 12:149-153; Clore et a!., 1994, Protein Science 3:372-390). The 
structure of a ligand (such as a pharmaceutical lead compound) bound to a protein is most 
efficiently determined when the protein is uniformly 13 C and 15 N labeled, and the binding 
occurs in the slow exchange limit (Clore et al., 1994, Protein Science 3:372-390). In this 
20 limit, a bound complex remains together long enough for resonances of the free and bound 
form of the ligand to be resolved. 

One method that avoids some limitations associated with X-ray crystallography and 
solution-state NMR and has significant advantages, is solid-state NMR, particularly 
dipolar-dephasing experiments such as rotational echo double resonance (REDOR) (Gullion 
25 et al., 1989, Journal of Magnetic Resonance 81:196-200; Gullion et al., 1989, Advances in 
Magnetic Resonance 13:57-83). Compared with crystallography, solid-state NMR has the 
advantage that it obtains high-resolution structural information from polycrystalline 
and disordered materials. This eliminates the need for the formation of highly regular 
crystals to achieve high resolution diffraction, and eliminates structural perturbations due to 
30 crystal packing forces. In contrast to solution-state NMR, which relies upon mutual 
correlations between nuclei from the indirect dipolar coupling (studied via the Nuclear 
Overhauser effect, NOE) that fall off as 1/r 6 , solid-state NMR relies upon the direct dipolar 
coupling, which decreases as 1/r 3 , for the measurement of internuclear distances, where r is 
the internuclear distance. As a result, longer distances can be measured with solid-state 
35 NMR, and the distances measured have a higher degree of accuracy and precision. 
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Furthermore, solid-slate NMR is not strictly limited by the size of the complex resulting 
from the drug bound to a target molecule. In the solid-state NMR experiments, the size 
limitations are determined primarily by the quantity of the sample available, and the 
sensitivity of the NMR spectrometer. 
5 One advantage of REDOR transform technique over solution-state NMR 

measurement is the direct and accurate determination of the internuclear distance from a 
measured frequency. Solution-state NMR experiments rely upon the indirect measurement 
of the dipolar coupling for distance measurements. In solution-state NMR there is no direct 
relationship between an experimentally measured parameter and the distance. Instead, the 
10 strength of the coupling, as inferred from the Nuclear Overhauser effect (NOE), is related to 
a range of possible distances spanning a few Angstroms. 

In addition to REDOR, other dipolar-dephasing (dipolar-recoupling) methods such 
as TEDOR (Hing et aL, 1993, Journal of Magnetic Resonance, Series A 103:151-162; Hing 
et al., 1992, Journal of Magnetic Resonance 96:205-209), DRAMA (Tycko et aL, 1990, 
15 Chemical Physics Letters 173:461^65; Tycko et aL, 1993, Journal of Chemical Physics 
98:932-943), DRAWS (Gregory et al., In 36th Experimental Nuclear Magnetic Resonance 
Conference; Boston, Mass., 1995; p 289), and MELODRAMA (Sun et al., 1995, Journal of 
Chemical Physics 102:702-707), are known in the art. 

As applied to biological materials, REDOR has primarily been used to determine the 
20 distance between one 13 C atom and one 15 N atom (Marshall et al., 1990, Journal of the 
American Chemical Society 1 12:963-966; Garbow et al., 1993, Journal of the American 
Chemical Society 1 15:238-244). Because of the nature of nuclear magnetic interactions in 
the solid state, REDOR has the inherent ability to measure internuclear distances with a 
high degree of accuracy and precision. REDOR measurements are accurate to better than 
25 0.05 A when the 13 C- 15 N distances are from 0 to 4 A, and to better than 0. 1 A when the 13 
C- 15 N distances are from 4 to 6A. Garbow and Gullion (Garbow et al., 1991, Journal of 
Magnetic Resonance 95:442-445) have shown that data acquisition rate using REDOR can 
be enhanced by the measurement of REDOR signals from chemically shifted nuclei. 

30 5.5.2 Rational Design of Molecu les and Prediction of Activity 

The success of rational drug design approaches can be enhanced through the use of 
computation methods to identify key molecular portions or fragments, which are relevant to 
the binding or activity of the putative drug leads. In order to predict the activity of 
molecules not yet synthesized or for which not much is known with respect to a particular 

35 chemical function, such as binding to a particular receptor, one would first start with 
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molecular structures and assay values of known molecules with known activities with 
respect to such chemical function. Examples of such methods include Rodgers D. and 
Hopfinger A. J., J. Chem. Inf. Comp. Sci., 1994, 34, 854-866, which is incorporated in its 
entirety by reference. The molecular structure information determined using the methods 
5 described above and assay information may provide input information for a neural network. 
As understood by one of ordinary skill in the art, neural networks may have a variety of 
structures and the example described below illustrates but one approach predicting the 
activity of a new molecule or for predicting pertinent conformational features of the active 
site. 

10 The molecules used in the training set, for example, may comprise members of 

target shape groups obtained by the process for obtaining a molecular diversity library, as 
described above. Training data, e.g., the molecular conformations and/or activities of the 
training set molecules are subsequently used in a learning model which is refined to 
generate consistent hypotheses to explain the training data. However, in order to make the 

15 learning process more efficient, a bootstrap procedure may be employed. This procedure 
includes rinding stable conformers from the structure data, posing the conformers and 
selecting initial poses from the poses to form an initial training set. After the training set is 
formed, the set is used in a learning step to refine a system which is then used to predict the 
activity of a molecule not in the training set. 

20 To increase the predictive power of the method, the training data preferably includes 

data on a plurality molecules. As known to those skilled in the art, biologically active 
molecules can take on different shapes known as conformers or conformations defined by 
the internal torsion angles of the rotatable bonds in the molecule. An active molecule, for 
example, may be any molecule that binds a member of a lower or higher ordered target 

25 shape group, as described above. In order to increase the computational efficiency in 
learning, however, it is desirable to choose only the conformations that are best in 
confirming or refuting the learning model. Thus, addition to molecules that bind members 
of a lower or higher ordered target shape group, molecules which have weak binding 
activities may be included in the training set. The model may be designed to account for the 

30 relative binding efficiencies of the included molecules by appropriately weighting the 
importance of each structure within the model. 

The first step in this selection involves posing the molecule. A pose of a molecule is 
defined by its conformation (internal torsion angles of the rotatable bonds) and orientation 
(three rigid rotations and translations). This mathematically defines the pose or geometrical 

35 conformation of the molecule. First, a conformer of an active molecule is chosen and its 
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pose is first fixed. The initial pose or conformation may be determined from empirical 
measurements or ab inito calculations. In chemical terms, this is analogous to permitting 
the molecule to rotate, translate and alter its conformation to achieve its best possible fit to 
the binding site. The rotation, translation and alteration in the internal torsion angles of the 

5 rotatable bonds in a molecule is referred to herein as reposing of the molecule. In other 
words, since the fixed pose of a molecule known to have high activity is used as the 
reference for reposing the remaining molecules, this crudely simulates the process of 
reposing the other molecules to achieve the best possible fit to the binding site. The 
above-described process can be performed using a number of software packages available 

10 commercially, such as, for example, Catalyst from BioCAD, Foster City, Calif., and 
Batchmin available from Columbia University, New York City, N.Y. 

The following example presents a non-limiting example of neural networks applied 
to conformational data of molecules. As understood by one of ordinary skill in the art, such 
models could also include empirical activity data such as, for example a relative binding 

15 efficiency. Alternatively, or in combination, the model may also include a combination of 
empirical conformational data and theoretical conformational data, such as, that derived 
from ab initio molecule structure calculations. The learning process now begins with a 
selection of only some of the poses to be in the training set In other words, if a molecule 
has more than one conformation, poorer matches may be dropped for computational 

20 efficiency in the subsequent learning process. In making the selection, various properties of 
the molecules in the data set known to chemists may be used, including physical and 
chemical properties such as shape, electrostatic interaction, solvation and biophysical 
properties, such as, for example, a binding efficiency to a predetermined target. 

Before the selected poses may be used for training, the relevant features of these 

25 poses are first extracted. Trie COMFA methodology described in U.S. Pat. No. 5,025,388, 
for example, employs a three-dimensional lattice structure and extracts the relevant features 
by calculating the steric and electrostatic interaction energies between a probe atom placed 
at each of the lattice intersections and the molecule. 

Another approach involves creating a surface representation of each of the poses and 

30 then obtaining a feature value between at least one sampling point and a point on the surface 
representation of each of the poses. For example, the van der Waals surface of at least 
several atoms in a predetermined portion of the molecule, such as, for example the active 
site, may be found. A curved surface having a number of ridges intersections of adjacent 
van der Waals surfaces results and is a surface representation of the portion of the molecule. 
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As known to those skilled in the ait, the election density around each atom can be 
represented as a Gaussian function of distance from the nucleus of the atom where the peak 
of such Gaussians would more or less coincide with the van der Waals radius of the 
atom. A surface representation of the portion of the molecule can then be obtained by 

5 summing the Gaussian functions for the atoms. The surface representation arrived at using 
the van der Waals surfaces of the atom has been found to be adequate and easy to find for 
most purposes for modeling biological and chemical activity whereas the sum of the 
Gaussian approach gives a scientifically more rigorous representation of such surface. The 
details of finding the van der Waals surfaces of atoms and calculations involving a surface 

10 such as surface are known to those skilled in the art and will not be explained in detail here. 
Similarly, the Gaussian distributions for the atoms and method for summing them are also 
known to those skilled in the art and will not be explained in detail here. Other than van der 
Waals and Gaussian surface representations, other types of surface representation are 
possible, such as a Connolly surface. See, M. J. Connolly, J. Appl. Cryst, 16, 548 (1983). 

15 

5.5.2a Feature Extraction 

The feature values, including steric, electrostatic or other feature values may be 
extracted by first specifying at least one sampling point and then obtaining a feature value 
between such sampling point and a point on the surface representation of each of the 

20 poses. The point may be outside but near the molecular surface and the feature value is 
extracted by determining, for example, the minimum distance between such sampling point 
and the surface representation of the pose. For simplicity, a surface representation of a pose 
determined in the manner above will be referred to simply as the surface of the pose. 
An electrostatic feature value may be extracted as the electrostatic interaction between a 

25 probe atom placed at such sampling point and the pose. Alternatively, the electrostatic 

feature value may be the sum of the Coulomb force interactions between the probe atom and 
atoms of the pose surface. The above described approach will be referred to herein as the 
point-based feature extraction approach. Preferably, a number of sampling points are chosen 
surrounding the poses. In other words, the same sampling points are used to extract features 

30 from each of the poses in the training set. To arrive at a common set of sampling points, 
one may select the points by reference to the averaged position of the poses in the training 
set. 



35 
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5.5.2b Form of the Model 

Once features have been extracted for each initial pose in the initial training set, 
these features are input to a parameterized mathematical model, such as for example, a 
neural network or principal component regression to produce an activity prediction. For 
5 example, let V (M, P) be the vector of n features extracted to represent molecule M in pose 
P. Let the kth component of this vector be denoted V (M, P\. 

During training, the optimal values for the model parameters are determined. It will 
be understood that the scope of this invention includes a wide range of mathematical 
models, including linear models and nonlinear models. In the preferred embodiment, the 
10 model has the form: 



Activity{V(M, P)) = Sigmoid 



where 

15 m is the number of weights 
Sigmoid (x)=l/(l+exp(-x)) 

exp is the exponential function (whose base e is the base of the natural logarithm) 
Uj is a real-valued weight and 
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25 jij a real-valued location parameter 

Oj is a real-valued width parameter 

The parameters of this model are: 

Ui (j=l...n) 

Vj,(j-l...n v i=l ...m) 
30 ft (i-l...m) 

Of (i-1 . . . m). 

In this emrxKliment, the function G is a Gaussian-iike function that will produce 
large values when the measured feature V (M, ?\ is near to fi t and smaller values when the 
measured feature is distant from ^. The value of Oj controls how rapidly the value of G 
35 decreases as V (M, P)j moves away from ft. 
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siamng vwuw initialized to be a small 

■ , ^ i^deauallywellto electrostatic features. The values of Mi 
same mathemaucal model apphed equally we 

10 and 0i for an electrostatic feature . descnbe (Box ^ 
valuesfor the feature (depending on the values of v,,). In fact, the samem 
r^licaoletootherbiologlcalacUvitytypesincludingbutn^^ 
agonism^tency.receptorselectivityandtissueselecttvrty. 

set of poses is known to those skilled in the art. 

As explained above, an initial set of poses is selected to form the training set in 
order to train the model. Then me initial values for the parameters n, and j are 

cht en The feature values of the poses in the training set are extracted as described above. 

the molecule. Then the parameter values set initially for feature , are modified to rmmtmze 

is known that the presence of such sites would influence the orientation and conformations 
^Lesprese^sothatinactualfacuthemolecules would repose under such ,nfluer«e 
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the parameter values in reference to poses in addition to or other than the best poses; all 
such variations are within the scope of the invention. 

If pj is the predicted activity of a particular pose j and aj its actual activity, then an 
error function for the training set of poses can be formed by, for example, the following 
5 equation: 

m 

Error Function = £ (P } - ap 2 
H 

where m is the total number of poses (preferably only the best poses) in the set in reference 
10 to which the parameter values are to be modified. A wide variety of computational 
methods may be applied to minimize the error function with respect to the parameters of 
the model (e.g., Uj, Vji, o i% n). Such methods are known to those skilled in the art and will 
not be described here. In the preferred embodiment, the gradient of the error function with 
respect to these parameters (except for n) is computed, and gradient descent methods are 
15 applied. Other methods such as conjugate gradient, Newton methods, simulated annealing, 
and genetic algorithms may also be used and are within the scope of the invention. 

After the differences between predicted and actual activities of poses (e.g., best 
poses) have been minimized, such as by minimizing the above error function, such 
differences are compared to preset thresholds. If the differences are below the preset 
20 threshold or thresholds, one concludes that the process has converged and proceeds. If not, 
then one returns to calculate the predicted activities of poses in the training set by reference 
to the modified parameter values and again choose the best pose for each molecule having 
the highest predicted activity. The parameter values are again modified to minimize 
differences between predicted and actual activities of best poses. This loop is repeated until 
25 the differences are found to be below preset threshold or thresholds and the same best poses 
are chosen every time. 

Then the molecules are reposed to maximize their activities and from the possible 
poses after the reposing, poses are chosen to form a new training set Instead of reposing 
the molecules, it is possible to simply re-select from the initial set of poses to form the 
30 training set of poses. However, it is believed to be preferable to repose the molecules in 
order to form a new training set. The new training set is compared to the prior training set 
to see whether the changes to the poses are below certain set threshold or thresholds. If the 
changes are found to be below the threshold(s), then the process of training the model is 
completed and one proceeds to the prediction step. If the changes to the poses are not below 
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the threshold or thresholds, then one returns to extract features from the training set as 
described above. 

Since the orientation and conformation of the poses may have changed, these new 
poses will have different feature values from those in the original training set. Therefore, 

5 the feature extraction step needs to be repeated. A molecule may be reposed by first 
re-orienting the molecule with respect to the sampling points. Then the internal torsion 
angles of the rotatable bonds are altered to re-conform the molecule to again best fit the 
surface portions of the molecule 

The above-described process makes good use of the salient feature of poses of 

10 inactive as well as active molecules. The above-described reposing process with aligned 
and conformed poses of active molecules to maximize the agreement of the observed a 
predicted activities and to repose the inactive molecules to be in the best position to refute 
the model. Thus, in order for the model to pass the above described testing process, it will 
predict the inactivity of poses of inactive molecules even though these have been realigned 

1 5 and reconfirmed to be in the best position to "fool" the model, while at the same time 
confirming the activity of the active molecules. 

Gradient search methods are also used for reposing the training molecules to 
maximize their predicted activities as functions of the orientation and conformational 
parameters. 

20 If the extracted features are differentiable functions of the orientation and 

conformational parameters and the model (as represented by the equations above) is a 
differentiable function of the values of the extracted features, the chain rule may be applied 
to compute the gradient of the predicted activity with respect to the orientation and 
conformational parameters and apply gradient-based search to find poses that maximize 

25 predicted activity. However, other kinds of models and other methods of feature extraction 
may not satisfy this property, in which case other computational methods (e.g., simulated 
annealing, linear programming) could be applied to find poses that maximize predicted 
activity- It is understood that the scope of the invention includes all methods for finding 
such poses. 

30 Instead of reposing the molecules, it is possible to simply re-select the best poses 

from the original set of poses formed prior to the selection step. Reposing the molecules 
rather than re-selecting from existing poses may reduce the error of prediction. 

The trained model and the ultimate parameter values may then be used to predict the 
activity of a new molecule with unknown activity. Thus, again, feature values are extracted 

35 from the poses of the molecule and the predicted activities of the poses are calculated to 
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find the best pose with the highest activity. Thus, the model not only enables the user to 
predict the activity of the molecule not in the training set but also predict its best pose, Its 
feature values in comparison with the parameter values would indicate which surface 
portions have the desirable properties in regard to a chemical function and which surface 
portions have undesirable properties in regard to such function. In fact, the model may be 
used to search a database of molecules with unknown activity and predict the acuvit.es of 
their poses. Poses of these molecules may be modified to alter their predicted acnvit.es. 



5,6 Fartnr Analysis 

10 Factor analysis provides a technique for expressing the behavior of a system of data 

in terms of orthogonal vectors, which form a basis set to describe the system. 
Advantageously, a model is not required to determine the variables which are important m 
describing the behavior of the system of data. Thus, for example, such an approach would 
allow one to discriminate from among various shape features those features which are 

15 tapormtino^tennimngadesiredactivityofamolecule. Factor analysis techniques are 
known in the art and are described, for example, in Uwton and Silvestre, Technometncs, 
vol 13 1971 pp 617-633 Edmund R. Malinowski Factor Analysis in Chemistry, 2nd 
Edition John Wiley & Sons, New York, 1993, which are incorporated in their entireties by 
reference. 

20 In a factor analysis approach, variables describing members of a targetshape group 

centered around one or more target compounds may be arranged in a matrix of data. Input 
variables may include, but are not limited to structural features determined from 
spectroscopic or other shape determining techniques as described above, the presence of 
one ormore molecular epitopes, the capacity to bind to an antigen, other compound, or 

25 surface, the ability to displace another molecule from a binding site, the ability to catalyze 
one or more reactions, or the ability to modify the catalytic activity of another molecule. 

Factor analysis coupled with self-modeling curve resolution allows the 
identification of significant components responsible for variation in data to be extracted in 
Ueu of a model describing their behavior. 

30 Initially a matrix D is created. The rows of D contain conformation and/or activUv 

data for molecules in the training set. Each column of D, therefore, corresponds to 
particular conformation or activity data such as, for example, a binding efficiency, an 
ability to modify an activity of a receptor molecule, a molecular radius, a ability to catalyze 
a chemical reaction, a bond length, a bond angle, or a relative or absolute distance between 

35 predetermined nuclei in each molecule. Oneof ordinary skill in the art understands that a 
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variety of conformational and activity data can be used to describe a molecule and each is 
applicable with the present invention. 

To find the correlated variation among the conformation or activity data and to 
identify the parameters that are best able to describe the activity of the molecules, a 
5 co variance matrix Z is formed from the matrix, D: 

Z = D T D 

where Z is a j by j square matrix whose rows describe the correlation between the columns 
10 of D. The covariance matrix may be diagonaJized by finding a matrix Q such that 

ZQ = XQ 

where Q is j by j a matrix of eigenvectors and k is a diagonal matrix of eigenvalues. The 
15 eigenvectors in Q are abstract representations of the variation across the columns of D. The 

magnitude of the jth eigenvalue, Aj, indicates the amount of variation in the data described 

by the jth eigenvector. 

If the variation in the data originates only from the n distinguishable components, 

then the rank of Z (in the absence of noise) would be n, and linear combinations of the n 
20 nonzero eigenvectors describe all of the variation across the columns in D. However, due to 

random experimental error, £ - a additional eigenvectors arise from decomposition of Z. 

These eigenvectors are dominated by experimental noise and their removal does not 

significantly impact analysis of the correlated behavior of the data. Furthermore, since the 

number of components, which are needed to describe the behavior of the molecules, in the 
25 data is not generally known a priori, determining n can be a significant step toward 

identifying molecular conformational parameters and activities best suited for identify 

promising drug leads. Both the relative magnitude of the eigenvalues and the shape of the 

eigenvectors may be used to estimate of the number of components. 

A number of approaches that rely on the relative magnitudes of the eigenvalues 
30 have been developed for estimating n when the experimental variance is unknown. The 

method of reduced eigenvalues, REV, was proposed in 1987 by Malinowski. The jth 

reduced eigenvalue is given by 

REV^A/a-j + l) (c-j + 1) 

35 
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The reduced eigenvalue ratio REV^/RbV, may 

the data exceeding trie noise. 1 nc n ^ 
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57 Prj n ri nal Cam p™^ Analysis 
10 " Asunderstoodinmeart.pnnrip^^ 

approach to modeling empirical data, which can be utilized with the present invent. 

presentation of the data but differs in that it aiso involves a regression ^Jf***™ 
pL^C^nrA^,Spnn g er-Verla g ,NewYorM986. Richard A. Reyment, et 

15 al Applied Factor Analysis in rhe Natural Sciences, Cambridge Univ Press, New York, 
1996, describe PCA approaches to data modeling and are included in their endues by 
reference. 

5.8 fnrnraitw Systems . 

FIG 2disclosesa re presenta^ 
the embodiments of the present invention may be implemented. Computer system 8 10 may 
be a personal computer, workstation, or a larger system such as a minicomputer. However, 
one skilled in the art of computer systems will understand that the present invent™ ,s not 
limited to a particular class or model of computer. 
25 Assho™inHG.3,representauvecompu^^^^ 
unit (C*U) 812. a memory umt 814, one or more s^ 

output device 820. and communication interface 822. A system bus 824 is provided for 
cornmunicauons between these elements. Computer system 810 may additionally funcUon 
tooughuse of an operating system such as Windows, DOS.orUNTX. However, one sMeu 
30 in the an of computer systems will understand that the present invention is not hurt to a 

particular operating system. 

StoragedevicesSiemayillustrativelyincludeoneormorefloppyorhardtekdnves. 

CD-ROMs. DVDs, or tapes. Input device 818 comprises a keyboard, mouse, microphone, or 
other similar device. Output device 820 is a computer monitor or any other known computer 
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is claimed is: 

1 . A method for predicting a property of a molecule comprising the steps of: 

obtaining an initial odd set of molecules that bind at least one molecule 
belonging to an origin set of molecules; 
5 obtaining an even set of molecules that bind at least one molecule belonging 

to said odd set of molecules; 

selecting a training set comprising a subset of the even set of molecules; 

determining a conformation for each of the training set molecules; 

constructing a model for predicting a predeterrnined property of at least one 
10 new molecule not assigned to the subset of the even set of molecules and wherein the new 
molecule has a new conformation and the model includes the conformation of at least some 
of the training set molecules; and 

predicting the predetermined property of the new molecule. 

15 2. The method according to claim 1 comprising the further steps of: 

selecting a training set comprising a subset from the odd set of molecules; 

and 

repeating the determining, constructing, and predicting steps wherein the 
model further comprises the conformation for each molecule in the odd subset. 

20 

3. The method of claim 1 wherein the constructing a model step further comprises the 
steps of: 

predicting a predetermined property of at least one molecule assigned to one 

of the subsets; 

25 conditionally modifying the model in response to a difference between said 

predicted predetermined property and an empirical estimate of said predicted property; and 

repeating said predicting and conditionally modifying steps until said 
difference reaches a predetermined value. 

30 4. The method of claim 1 wherein the predetermined property is the ability to bind to 
at least one predetermined molecule and the empirical estimate is determined from a 
binding assay. 

5. The method of claim 1 wherein the model comprises at least one of a neural 
35 network, a factor analysis, or a principal components analysis. 
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5 6. The method of claim 5 wherein the model is a neural network comprising: 

a plurality of layers each having at least one node wherein the plurality of 
layers include a first layer having at least one node coupled to an input value and a second 
layer having at least one node coupled to a plurality of nodes of said first layer, and 

10 5 the first layer having at least one node with a first transfer function and the 

second layer having at least one node with a second transfer function. 

7. The method of claim 5 wherein the conformation is determined by at least one of x- 
*5 ray crystallography, nuclear magnetic resonance, or molecular modeling. 

10 

8. The method of claim 1 wherein the conformation comprises at least one of an 
absolute positions of atomic nuclei in each molecule, a relative position of atomic nuclei in 

20 each molecule, an electron density distribution, a bond angle, a bond length, or a van der 

Waals radii of atoms in the molecule. 

15 

9. The method of claim 1 further comprising the step of searching a conformational 
data base for molecules having a conformation similar to the new conformation. 

10. The method of claim 1 further comprising the step of synthesizing at least a portion 
20 of the new molecule. 

1 1 . The method of claim 9 wherein the new molecule comprises at least one of DN A, 
RNA, a peptide, a polypeptide, or a small molecule. 
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30 



35 



25 12. The method of claim 10 further comprising the steps of: 

producing one or more first variants of the new molecule that are at least 
somewhat similar to the new molecule; and 
40 selecting one or more of the first variants having at least one desired 

characteristic. 

30 

13. The method according to claim 12 wherein the first variants comprise a stochastic 
45 sequence of polynucleotides. 

14. The method according to claim 10 further comprising raising antibodies against the 
35 new molecule. 
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15. A method for predicting a property of a molecule comprising the steps of: 

obtaining an initial odd set of molecules mat bind at least one molecule 
belonging to an origin set of molecules; 
5 obtaining an even set of molecules that bind at least one molecule belonging 

to said odd set of molecules; 

obtaining an odd set of molecules that bind at least one molecule belonging 

to said even set of molecules; 

repeating said obtaining an odd set of molecules and said obtaining an even 
10 set of molecules steps to generate a sequence of odd and even sets of molecules wherein the 
molecules in each of said sets bind to at least one of the molecules in a preceding one of the 

sets in the sequence; and 

selecting a training set comprising an even subset from each of at least two 

even sets of molecules; 
1 5 determining a conformation for each molecule in each of said subsets; 

constructing a model for predicting a predetermined property of at least one 
new molecule not assigned to the subsets of molecules wherein the model comprises the 
conformation of at least some of the molecules from each even subset; and 

predicting a predetermined property of the new molecule. 

20 

16. The method of claim 15 further comprising the steps of: 

selecting a training set comprising a subset from each of at least two odd 

sets of molecules; and 

repeating the determining, constructing, and predicting steps wherein the 
25 model further comprises the conformation for each molecule in each of odd subsets. 

17. The method of claim 16 wherein the constructing a model step further comprises 
the steps of: 

predicting a predetermined property of at least one molecule assigned to one 

30 of the subsets; 

conditionally modifying the model in response to a difference between said 
predicted predetermined property and an empirical estimate of said predicted property; and 

repeating said predicting and conditionally modifying steps until said 
difference reaches a predetermined value. 

35 
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5 1 8. The method of claim 17 wherein the predetermined property is the ability to bind to 

at least one predetermined molecule. 

19. The method of claim 1 8 wherein the model comprises at least one of a neural 

10 5 network, a factor analysis model, a principal components analysis model, or an independent 

component analysis model. 

20. A method for predicting a property of a molecule comprising the steps of: 
15 selecting a first origin set of molecules; 

10 obtaining an initial odd set of molecules that binds at least one molecule 

belonging to the first origin set of molecules; 

obtaining an even set of molecules that bind at least one molecule belonging 

to said odd set of molecules; 

obtaining an odd set of molecules that bind at least one molecule belonging 

15 to said even set of molecules; 

repeating said obtaining an odd set of molecules and said obtaining an even 
set of molecules steps to generate a sequence of odd and even sets of molecules wherein the 
molecules in each of said sets bind to at least one of the molecules in a preceding one of the 
sets in the sequence; 

20 selecting a second origin set of molecules and repeating said obtaining an 

initial odd set, obtaining an even set, obtaining an odd set and said repeating steps to 
generate a second sequence of odd and even sets of molecules; 

selecting a training set comprising an even subset from each of at least two 
35 even sets of molecules belonging to the first and second sequences; 

25 determining a conformation for each molecule in each of said subsets; 

constructing a model for predicting a predetermined property of at least one 
new molecule not assigned to the subsets of molecules wherein the model comprises the 
40 conformation of at least some of the molecules from each even subset; and 

predicting a predetermined property of the new molecule. 

30 

21 . The method of claim 20 wherein the predetermined property comprises the ability 
45 0 f me new molecule to bind to each of at least two predetermined molecules. 
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